kOps K8s Control Plane Monitoring with Datadog

5 min readSep 15, 2022

Context

Due to a lack of control plane observability, I recently re-integrated Datadog helm chart on our kops-provisioned k8s cluster. kops is definitely not the most popular k8s solution and the official control plane monitoring guide doesn't cover detailed steps.

Throughout the process, I ran into one major issue (details later), potentially caused by compatibility between kops and datadog-agent. Investigation kept me busy, and I still don't have a definitely answer to "fix" it. However, I came up with a solution to bypass this issue to ensure full visibility coverage for the control plane.

Overview

A k8s control plane has 4 major components:

kube-api-server
etcd
kube-scheduler
kube-controller-manager

all of which are supported by native Datadog integrations (comes with datadog-agent). The recommended integration guide depends on kubernetes integration auto-discovery, but it does not work on kops-provisioned control planes

I’ll walk through the issue and findings, and follow up with a step-by-step guide on how to bypass it.

The details covered here are based on the following setup:

kops-provisioned k8s cluster
datadog-agent deployed via the datadog helm chart

For brevity, I’ll refer to “kops-provisioned control plane node(s)" as "control node(s)" unless explicitly specified.

Problem

Using kube-scheduler to illustrate the problem (same problem for all 4 components).

Example integration (values.yaml for datadog helm chart):

datadog:
  apiKey: <DATADOG_API_KEY>
...
  ignoreAutoConfig:
  - kube_scheduler
...
  confd:
    kube_scheduler.yaml: |-
      ad_identifiers:
        - kube-scheduler
      instances:
        - prometheus_url: https://%%host%%:10259/metrics
          ssl_verify: false
          bearer_token_auth: true

This is the recommended approach from the official control plane monitoring guide, and the approach here is based on kubernetes integration auto-discovery

On a control node, the configuration above does NOT turn on the integration(s):

valid configuration file for the integration exists under /etc/datadog-agent/conf.d/
integration is not detected as on via agent status output

Investigation and Findings

I thoroughly checked the configuration content and walked through the helm chart values reference as the first thing to check, and the I did not see anything wrong.

I’ve had experience setting up Datadog helm chart for control plane monitoring on EKS clusters and docker-desktop / minikube using helm, the identical configuration doesn't work 100% but at least the integrations are detected correctly with auto-discovery. The container names I saw via running docker ps on control nodes have the right shortname & image name (which are used to derive ad_identifier by datadog-agent). So I'm sure the configuration (especially the ad_identifiers section) is not the problem.

The next thing I did was turn on debug log(datadog.logLevel: debug, logs available at /var/log/datadog/agent.log) for datadog helm chart on both my kops cluster and a docker-desktop / minikube. From the debug log I figured roughly how auto-discovery datadog-agent works:

filed-based configurations (/etc/datadog-agent/conf.d/) are loaded into memory and running containers & processes are detected
each detected container/process will have an identifier, which is compared against configurations of integrations with auto-discovery turned on (via ad_identifiers)
once an ad_identifier is matched, the rest of the yaml configuration will be used for the integration.

The process above can be verified via debug log.

On a control node, the desired container (kube-scheduler, same for the other 3 components) is NOT identified as kube-scheduler. I noticed many containers were identified as container ids (in the format of "docker://<container_id>") but none of the container ids match the actual container id for kube-scheduler (you can identify container id by kubectl describe pod/<kube-scheduler-pod-name> or ssh to control node and run docker ps).

Either the kube-scheduler(same for the other 3 components) is not detected, or it is detected as a container id that doesn't match its own.

This is where I realized that there aren’t further actionable things I can do with this approach. Fortunately, my goal is to get integrations working for control nodes, one way or another, and I was able to come up with an alternative solution.

Solution

The TL;DR; version of the solution is: use file-based configuration without auto-discovery.

Integrations are driven by configuration files (located under /etc/datadog-agent/conf.d/). The helm-native approach mentioned above works by converting the datadog.confd key-value pairs to one auto_conf.yaml per integration. The non-helm solution for configuration is to provision your own conf.yamls for your integrations.

To bypass the auto-discovery issue on kops-provisioned cluster, we can:

provision a ConfigMap with desired configurations
mount the ConfigMap as volume(s) to datadog-agent: agents.volumes + agents.volumeMounts
replace template variables with resolvable ones
disable auto-config (synonym for “autodiscovery”) for integration: datadog.ignoreAutoConfig

Datadog configuration `k8s` `ConfigMap`

apiVersion: v1
kind: ConfigMap
metadata:
  name: my-datadog-configmapdata:
  kube_apiserver_metrics.yaml: |+
    init_config:
    instances:
      - prometheus_url: https://%%env_DD_KUBERNETES_KUBELET_HOST%%:443/metrics
        tls_verify: false
        bearer_token_auth: true
        bearer_token_path: /var/run/secrets/kubernetes.io/serviceaccount/token
  etcd.yaml: |+
    init_config:
    instances:
      # etcd-manager-main
      - prometheus_url: "https://%%env_DD_KUBERNETES_KUBELET_HOST%%:4001/metrics"
        tls_verify: false
        tls_cert: /host/etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.crt
        tls_private_key: /host/etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.key
      # etcd-manager-events
      - prometheus_url: "https://%%env_DD_KUBERNETES_KUBELET_HOST%%:4002/metrics"
        tls_verify: false
        tls_cert: /host/etc/kubernetes/pki/etcd-manager-events/etcd-clients-ca.crt
        tls_private_key: /host/etc/kubernetes/pki/etcd-manager-events/etcd-clients-ca.key
  kube_scheduler.yaml: |+
    init_config:
    instances:
      - prometheus_url: "http://%%env_DD_KUBERNETES_KUBELET_HOST%%:10251/metrics"
        ssl_verify: false
  kube_controller_manager.yaml: |+
    init_config:
    instances:
      - prometheus_url: "http://%%env_DD_KUBERNETES_KUBELET_HOST%%:10252/metrics"
        ssl_verify: false

Explanation

template variables are specific to autodiscovery feature. In the context of non-autodiscovery configurations, not all template variables can be resolved. e.g. %%host%% does not resolve. Fortunately %%env_<ENV_VAR>%% seems to be resolving fine.
kops provisions 2 etcd clusters: main and events. 2 instances of etcd integration is required with slightly different tls_cert and tls_private_key. Although I've verified these are interchangeable.
kops uses etcd-manager as the parent process for etcd. Port 2380/2381 is for peer communication (server-to-server), and 4001/4002 is for client communication (client-to-server). In this case the agent will be a "client" of etcd server, using port 4001 / 4002 is desired (instead of the 2379 port in normal etcd setup).
kube-scheduler serves HTTP on port 10251 and HTTPS on port 10259
kube-controller-manager serves HTTP on port 10252 and HTTPS on port 10257

`value.yaml` for `datadog` `helm` chart

datadog:
  ignoreAutoConfig:
  - etcd
  - kube_scheduler
  - kube_controller_manager
  - kube_apiserver_metricsagents:
  volumes:
    - name: my-config
      configMap:
        name: my-datadog-configmap
    - name: etcd-pki
      hostPath:
        path: /etc/kubernetes/pki
  volumeMounts:
    - name: etcd-pki
      mountPath: /host/etc/kubernetes/pki
      readOnly: true
    - name: my-config
      mountPath: /etc/datadog-agent/conf.d/kube_apiserver_metrics.d/conf.yaml
      subPath: kube_apiserver_metrics.yaml
    - name: my-config
      mountPath: /etc/datadog-agent/conf.d/etcd.d/conf.yaml
      subPath: etcd.yaml
    - name: my-config
      mountPath: /etc/datadog-agent/conf.d/kube_scheduler.d/conf.yaml
      subPath: kube_scheduler.yaml
    - name: my-config
      mountPath: /etc/datadog-agent/conf.d/kube_controller_manager.d/conf.yaml
      subPath: kube_controller_manager.yaml

Explanation

Certificates and private keys (located under /etc/kubernetes/pki from host) is required by etcd client-to-server communication.

kOps K8s Control Plane Monitoring with Datadog

Context

Overview

Problem

Investigation and Findings

Solution

Datadog configuration k8s ConfigMap

Explanation

value.yaml for datadog helm chart

Explanation

Written by Xing Du

Datadog configuration `k8s` `ConfigMap`

`value.yaml` for `datadog` `helm` chart